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(N Abstract 

c3 We present a computational approach for generating Markov bases for multi-way contin- 

gency tables whose cells counts might be constrained by fixed marginals and by lower and 
^ upper bounds. Our framework includes tables with structural zeros as a particular case. In- 

{Sj stead of computing the entire Markov basis in an initial step, our framework finds sets of local 

moves that connect each table in the reference set with a set of neighbor tables. We construct 
Q a Markov chain on the reference set of tables that requires only a set of local moves at each 

iteration. The union of these sets of local moves forms a dynamic Markov basis. We illustrate 
the practicality of our algorithms in the estimation of exact p- values for a three-way table with 
structural zeros and a sparse eight-way table. Computer code implementing the methods de- 
scribed in the article as well as the two datasets used in the numerical examples are available 
as supplemental material. 

Keywords: Contingency tables, Exact tests, Markov bases, Markov chain Monte Carlo, Struc- 
tural zeros. 
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1 Introduction 

m 
o 

y—i Sampling from sets of contingency tables is key for performing exact conditional tests. Such tests 

arise by eliminating nuisance parameters through conditioning on their minimal sufficient statistics 
( |Agresti, 1992[ ). They are needed when the validity of asymptotic approximations to the null dis- 

^ tributions of test statistics of interest is questionable or when no such approximations are available. 



Kreiner (1987) argues against the use of large-sample \" approximations for goodness-of-fit tests 
for large sparse tables, while Haberman ( 1988) raises similar concerns for tables having expected 
cell counts that are small and large. The problem is further compounded by the existence of struc- 
tural zeros or by limits on the values allowed on each cell, e.g. occurrence matrices in ecological 
studies ( |Chen et al, 2005| ). 

One of the earlier algorithms for sampling two-way contingency tables with fixed row and 



column totals is due to Mehta and Patel ( 1983). Other key developments include the importance 
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sampling approaches of Booth and Butler ([1999]), Chen et al. ( [2005] ), Chen et al. ( |2006| ) and 
Dinwoodie and Chen ( 2010| ). Various Markov chain algorithms have been proposed by Besag and 
Clifford (19891 ), Guo and Thompson (19921 ), Forster et al. ( fT996| ) and Caffo and Booth pOOl) . A 
very good review is presented in Caffo and Booth (2003 1. 

One of the central contributions to the literature was the seminal paper by Diaconis and Sturm- 
fels (1998). They generate tables in a reference set T through a Markov basis. The fundamental 
concept behind a Markov basis is easily understood by considering all the possible pairwise dif- 
ferences of tables in T, i.e. M. = {n' — n" : n', n" E T}. The elements of M. are called moves. 
Any table n' E T can be transformed in another table n" E T by applying the move n" — n' E M.. 
Clearly not all the moves in M. are needed to connect any two tables in T through a series of 
moves. A Markov basis for T is obtained by eliminating some of the moves in M. such that the 
remaining moves still connect T. Generating a Markov basis is in the most general case a com- 
putationally difficult task that is solved using computational algebraic techniques. The simplest 
Markov basis contains only moves with two entries equal to 1, two entries equal to —1 and the 
remaining entries equal to zero. It connects all the two-way tables with fixed row and columns 
totals (Diaconis and Sturmfels, 1998). These primitive moves extend to decomposable log-linear 
models as described in Dobra (2003). A divide-and-conquer technique for the determination of 
Markov bases for reducible log-linear models is given in Dobra and Sullivant ( |2004j ). Additional 
information on Markov bases can be found in Drton et al. ( j2009[ ). 

In this paper we focus on the general problem of the determination of a Markov basis for sets 
of multi-way tables defined by fixed marginals and by lower and upper bounds constraints on each 
cell count. Bounds constraints arise in disclosure limitation from information deemed to be public 
at a certain time (Willenborg and de Waal, 2000). In ecological inference lower bounds constrains 
are induced by individual-level information ( Wakefield, 2004[ ). Noteworthy theoretical contribu- 
tions on Markov bases for bounded tables include Aoki and Takemura (2005), Rapallo (2006), 
Rapallo and Rogantin ( [20071 ), A °ki and Takemura (|20T0j), and Rapallo and Yoshida (|20T0j). Un- 
fortunately, it is quite difficult to carry out a principled assessment of the practical value of their 
algebraic statistics results for tables with more than two dimensions due to the absence of dedi- 
cated software that would make these methods accessible to lay users. 

So far the papers dedicated to Markov bases have attempted to generate them in a preliminary 
step that needs to be completed before the corresponding random walk can be started. In practice 
this step can be computationally prohibitive to perform because the resulting Markov bases con- 
tain a very large number of elements even for three-way tables (De Loera and Onn, 2005[ ). The 
Markov bases repository of Kahle and Rauh (http : / /mbdb . mis . mpg . de) is very useful for 
understanding the complexity of the moves even for simple, non-decomposable log-linear models. 
We avoid this major computational hurdle by developing dynamic Markov bases. Such bases do 
not have to be generated in advance. Instead, at each iteration of our Markov chain algorithm we 
sample from a set of local moves that connect the table that represents the current state of the chain 
with a set of neighbor tables. Our computational approach extends the applicability of Markov 
bases to examples that could not be handled with other approaches presented in the literature. 

The structure of the paper is as follows. In Section [2] we present the notations and the setting 
of our framework. In Section [3] we introduce dynamic Markov bases and present two algorithms 
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for sampling multi-way tables. In Section [4] we discuss these algorithms in the context of the im- 
portance sampling approaches of Booth and Butler ( |1999 ) and Chen et al. (2006). In Sections 
[5] and [6] we give our Markov chain algorithm based on dynamic Markov bases. In Section [7] we 
illustrate the applicability of our methodology for a three-way table with structural zeros and a 
sparse eight-way table. In Section[8]we make concluding remarks. 



2 Notations and Framework 

Let X = (Xi,X 2 , . . . ,Xk) be a vector of discrete random variables. Variable Xj takes values 
Xj £ Xj = {1, 2, . . . , Ij}, Ij > 2. Consider a contingency table n = {n(i)}i<=z of observed 
counts associated with X, where X = x* =1 Ij are cell indices. The set X is assumed to be ordered 
lexicographically, so that J = {i 1 , i 2 , . . . , i mc }, where i l — (1, . . . , 1, 1), i 2 — (1, . . . , 1, 2), i mc = 
(Ji, J 2 , . . . , Ik) and m c = fx • J 2 • . . . • Jfc is the total number of cells. With this ordering the k- 
dimensional array n = {n(i)}i^x is written as a vector n = {n(i 1 ) , n(i 2 ) , . . . , n(i mc )}. For C C 
K = {1, . . . , k}, the C-marginal nc = {nc{ic)}i c eZc °f n ls me cross-classification associated 
with the sub-vector Xc of X, where Ic = Xj<=cZj- The grand total of n is n§. 

Consider two other fc-way tables n L and n u that define lower and upper bounds for n. The 
role of these bounds is to specify various constraints that might exist for the cell entries of n. For 
example, a structural zero in cell i G lis specified as n L {i) = n u [i) = 0. Zero-one tables are 
expressed by taking n L {i) = and n u (i) = 1 for all i E Z. In addition to the bounds constraints, 
the cell entries of n can be required to satisfy a set of linear constraints induced by a set of fixed 
marginals {n c ■ C £ C}, where C = {Ci, . . . , C q }, with Cj C K for j = 1, . . . , q. We let A be a 
log-linear model whose minimal sufficient statistics are {n c : C £ C}. We define the set of tables 
that are consistent with the minimal sufficient statistics of A and with the bounds n L and n u : 

T = jn' = : n' c . = n c ., for j = 1,. . . ,q,n L (i) < n'(i) < n u (i), for i £ j} . (1) 

We assume that n £ T, that is, the bounds constraints n L and n* 7 are not at odds with the observed 
data. The set T induces bounds constraints L(i) and U (i) on each cell entry n(i), i £ Z: 

L(i) = min{?7/(z) : ri' £ T} , U(i) = max{n'(i) : n £ T} . 

These bounds are possibly tighter than the initial bounds n L and n u , i.e. 

< n L (i) < L(i) < n'(i) < U(i) < n u (i) < n , 



for i £ X and n' £ T. They can be determined by integer programming algorithms (Boyd and 



Vandenberghe, 2004) or by other methods such as the generalized shuttle algorithm (Dobra and 
Fienberg, 2010). The constraints that define T can lead to the exact determination of some cell 
counts. More explicitly, we consider S C X to be the set of cells such that L{i) < U(i). This 
means that all the tables in T have the same counts for the cells in X \ S. We note that the 
determination of S needs to be made based on the bounds L = {L(i)} i€T and U = {U(i)} ieI 
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and not on n L and n u . Thus the set T comprises all the integer arrays n' that satisfy the equality 
constraints 



n. 



l> Oi ) = n c 3 {icj ) , for i Cj EXc j ,j = l,2,...,q, 
n'(i) = n(i), for i e Z\ S, 

as well as the bounds constraints 

L(i) < n'(i) < U(i), fori e S. 



(2) 



(3) 



By ordering the cell indices X lexicographically the equality constraints ([2]) can be written as a 
linear system of equations 



An' 



where A is a m r x m c matrix with elements equal to or 1, 



£ 

i=i 



(4) 

X c \ + |Z\5| and 6 = An 



is a m r -dimensional column vector. Here \E\ denotes the number of elements of a set E. In order 
to simplify the notations we subsequently assume that S = X with the understanding that the 
determination of S is key and needs to be completed before our algorithms are applied. 

Two distributions defined on T play a key role in statistical analyses. They are the uniform and 
the hypergeometric distributions 



1 

\f\ 



and Pff(n') 







-l 




Un'(i)\ 






-i -l 



E 



l\n"( 



(5) 



n"ST lie! 

for n' £ T. In the most general case, the normalizing constants of Ph (•) and Pu(-) can be com- 
puted only if T can be enumerated. Sundberg ( 1975j ) developed a formula for the normalizing 
constant of Ph(-) if A. is decomposable and there are no bounds constraints (i.e. n L {i) = and 
n u [i) = n$ for all i el). Sampling from Pu{') is relevant for estimating the number of tables in 
T (|Chen et al., 2005 ; Dobra et al., 2006) or for performing the conditional volume test (Diaconis 
and Efron, 1985). The hypergeometric distribution P#(-) arises by conditioning on the log-linear 



model A and the set of tables T under multinomial sampling. Haberman (1974) proved that the 
the log-linear interaction terms cancel out, which leads to equation ([5]). 

Sampling from Pu(-) and Ph(-) is straightforward if T can be explicitly determined, but this 
task is computationally infeasible for most real- world datasets. The goal of this paper is to de- 
velop a sampling procedure from Ph(-) and Pu(-) for any set of tables T induced by a set of fixed 
marginals and lower and upper bounds arrays. 



3 Dynamic Markov Bases 

Producing an entire Markov basis up-front is computationally expensive; it also makes random 
walks impractical for reference sets T involving sparse high-dimensional tables. Such bases con- 
tain an extremely large number of moves that are difficult to handle in the rare cases when they can 
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actually be found using an algebra package. However, one does not necessarily need to know the 
entire Markov basis in order to run a Markov chain on T. The Markov bases we introduce in this 
section are dynamic because they are not generated ahead of time. They consist of sets of moves 
that connect a given table n* G T with a set of neighbor tables nbd T (n*) C T. The union of the 
sets of neighbor tables should be symmetric (i.e., n! G nbd T (n") if and only if n" G nbd T (n')), 
and their union should connect T, i.e., 

[J {n" - ri : n" G nbd T (n')} (6) 

n'eT 

is a Markov basis for T. The moves given by the difference between a table n' and one of its 
neighbors n" G nbd T (n') are called local. 

The sets of neighbors are determined as follows. For two integers a < b, we denote (a : b) = 
{a, a + 1, ... , 6}. We define (a : 6) = if a > 6. Let A mc denote the set of all permutations of 
(1 : m c ). For a permutation 5 G A mc , we define the set of tables T<5 that is obtained by reordering 
the cell counts of tables in T according to 5. The re-ordered version n* s G T5 of n* is such that 
nJ(i J ) = n*(i 5 ^) for 1 < j < m c . The difference between T s and T relates to the ordering of 
their cells. We have T = T So where 5 £ A mc , 5 (j) = j for 1 < j < m c . 

For a table n* e T and an index s e (1 : m c ), we define the set of tables that have the same 
counts in cells {i 5 ^\ . . . , i 5 ^} as table n*: 

T 5 , s (n*s) = {n' s E T 5 : n' s = n* 5 , for j = 1, . . . , s} . (7) 

We define 7> )0 (nJ) = T 5 . We have T 5jmc (n|) = {n^}, and nj G 7> )S (nJ) for any s G (0 : m c ). The 
sets of tables Ts tS (n* s ) become smaller as the number of common cells increases, i.e. Ts :S (n* s ) D 
Ts, s i(n* s ) for < s < s' < m c . We consider the minimum and the maximum values of cell v> in 
the set of tables T SjS (n* s ), i.e. 

£«,nv(i J ) = min {n'sd 3 ) '■ n 8 ^ ^K)} > U S ,n*,s( iJ ) = maX : n S e T 5,sK)} ■ 

Remark that Ls, n *,s(i j ) = Us, n *, s (i j ) = n* s (i j ) for j G (1 : s). Determining the minimum and 
maximum values for the remaining cells without exhaustively enumerating T$ jS (n* s ) can be done 
by computing the integer lower and upper bounds induced on each cell by the constraints that 
define this set of tables. For j G ((s + 1) : m c ), Ls iTl *,s(i j ) and Us,n*,s(i j ) are the solutions of the 
linear programming problems 

minimize ±n' 5 (P) (8) 

subject to An' = b, 

L(i) <n'(i) <U(i), fori el, 
n' 5 {i j )^n* 5 {i j ) , forj = l,...,s, 
G N, fori G X. 

Here N is the set of nonnegative integers. Computationally it is quite demanding to determine the 
integer bounds Ls :fl *,s(i j ) an ^ Us, n *, s {V), hence we approximate them with the integer counterparts 
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of the real bounds Lf n , S (V) and Uf^ S {V). These real bounds are calculated by solving the 
optimization problems ^ without the constraints n'(i) G N, for i G X. In general, we have 

L s , n *, s (i j ) > \ L f,n*,s(i j )] , U^i?) < [U« n * !S (i j )\ . 

We denote by \a] and |_aj the smallest integer greater than or equal to a and the largest integer 
smaller than or equal to a, respectively. For the purpose of implementing the procedures described 
in this paper the approximation given by rounding the real bounds seems to perform well. 

We describe a method for randomly sampling a table in Tg. Algorithm [T] generates a feasible 
table by sequentially sampling the count of each cell given that the counts of the cells preceding it 
in the reordering of X defined by 5 have already been fixed. The permutation 5 defines the order in 
which the cell counts are sampled. The set of possible values of each cell are defined by the lower 
and upper bounds induced by the constraints that define T and the cell counts already determined. 
This procedure is employed at each iteration of the sequential importance sampling (SIS) algorithm 



(Chen et al., 2006; Dinwoodie and Chen, 2010) and has also been suggested, in various forms, in 



other papers (Chen et al., 2005; Dobra et al., 2006; Chen, 2007). We note that the determination 



of multi-way tables through a sequential adjustment of cell bounds appears in earlier writings such 



as Dobra (2002) who proposes a branch- and-bound algorithm for enumerating all the multi-way 



tables consistent with a set of linear and bounds constraints, as well as Dobra et al. (2003) and 



Dobra and Fienberg (2010) who develop the generalized shuttle algorithm. 



Algorithm 1 Sample a table n' s G T$ 



1: Consider a table n' 8 whose cells are currently unoccupied. 

2: Set s <- 1. 

3: while s < m c do 

4: Calculate the updated bounds for cell i s . If s = 1, set L'^ ^i 1 ) = L(i 5 ^) and U' Sn , fi (i l ) = 
U(i 5 ^ ) . Otherwise solve the linear programming problems ^ to determine the real bounds 
forcell^andsetL^,^) = \ L* n , ^(i*)] andf/^,^) = [U s R n ,^)\. 

5: ifL' 5>nl>s _ 1 (i s )>U' Stnl>s _ 1 (i s )then 

6: STOP. {The algorithm terminates without generating any table.} 
7: else 

8: ifL' 5in , s _ 1 (^) = ^ n , s _ 1 (z s )then 
9: Setn^^L'^,^). 
10: else 

( L f (i s ) m U' (i* 1 )} 

11: Sample a cell value n' s (i s ) from a discrete distribution /V <W,*-i v )k v(-)with 

support [L'^S') ■■ U' 5 , n/ , s ^(i s )). 
12: end if 

13: Go to the next cell by setting s 4— s + 1. 
14: end if 
15: end while 
16: return n' s 

Algorithm [T] ends at line 6 without returning a table if the combination of cell values chosen 
at the previous iterations does not correspond with any table in Tg. Such combinations could arise 
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because there are gaps between the bounds that correspond with integers for which there do not 
exist any tables in T associated with them. This issue has been properly recognized and discussed 



in Chen et al. (2006) who also propose conditions which they call the sequential interval property 
that check whether gaps exist for certain tables and configurations of fixed marginals. To the 
best of the authors' knowledge, there are no computational tools that implement these conditions. 
Once such tools become available, Algorithm [T] could be improved by replacing lines 5 and 6 with 
a procedure for identifying which integers in \L' Sn , s _i(i s ) '■ U' Sn , s _ 1 (i s )) actually correspond to 
least one table in T. This set of integers becomes the support of the discrete distribution from line 
11. With this refinement Algorithm [T] will always return a valid table. In the numerical examples 
from Section [7] we use the reciprocal distribution f^ ,u {v) oc 1/(1 + v) to sample a cell value at 
line 11. Other possible choices include the uniform distribution f^' u (v) = 1/(U — L + 1) or the 
hypergeometric distribution fc u {v) = 0(l+u-v) I (l+u)- 

Algorithm [T] finds any table n* 5 6 Tg with strictly positive probability 



K5j(L-u) { .){n s ) oc 



Y[ /(^,„*. s -i( iS ) ;C/ ^*, s -i( jS ))( n *(^)). (9) 



We define the neighbors of n* s as the set of tables returned by Algorithm [TJ i.e. nbd Tg (n* s ) = Tg. 
The corresponding set of local moves ([6]) is a Markov basis for Tg. Since T and Tg are in a one-to- 
one correspondence, this is also a Markov basis for T. This Markov basis is dynamic because its 
moves are sampled using Algorithm [T] from the distribution ([9]). 

Algorithm [T] returns a table in Tg only after it has computed lower and upper bounds for each 
cell in X. Calculating 2m c bounds to generate one feasible table could be quite expensive es- 
pecially for high-dimensional sparse tables. The counts of zero that characterize such tables are 
likely to make quite a few cells take only one possible value given the current values of the cells 
that have been already fixed - see lines 8 and 9. Therefore the efficiency of Algorithm 1 can be 
increased by identifying these fixed-value cells without computing bounds. We consider an array 
x = {^(z 1 ), . . . , x(i mc )}. We transform the linear system of equations Q defined by the equality 
constraints §2§ by reordering the columns i 1 , . . . , i mc of the matrix A according to 5. The reordered 
versions of A and x are Ag and xg. The column of Ag that corresponds with Xg{P) is equal with 
the column of A that corresponds with x(i s ^). An equivalent form of the linear system Q is 

A s xg = b. (10) 

We take the augmented m r x (m c + 1) matrix [Ag \ b] obtained by stacking Ag and b along side 
each other. We determine the reduced row echelon form (RREF) [Ag \ b] of [Ag \ b] using Gauss- 
Jordan elimination with partial pivoting - see, for example, Shores (2007 ). The linear system (10) 
is equivalent with 

Agx s = b, (11) 

whose number of rows m' r < min{m r ,m c } is equal with the rank of A. Since the linear system 
( [TT] ) has fewer equations than the initial linear system ([4]), it is more efficient to make use of it when 
defining the linear programming problems ([8]). A smaller number of constraints translates into 
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reduced computing times in the determination of the bounds in line 4 of Algorithm[T] Furthermore 



it is possible to re-arrange the columns of As and the coordinates of x§ such that the system ( 1 1 ) is 
written as 



I m ,xf + Afatf = b, (12) 

where xf represent the m' r bound variables of the equivalent linear systems (j4j) and ( 10), xf is the 
(m c — m' r ) -dimensional vector of free variables and Ii is the /-dimensional identity matrix. Once 
the values of the free cells xf are fixed, the values of the bound cells are immediately determined: 

xf = b-Afxf. (13) 



Algorithm 2 Sample a table n' s 6 T s (RREF version) 



Consider a table n' s whose cells are currently unoccupied. 
Find the RREF of the linear system ( fT0| ). 

Sample values for the free cells {(n' s )f : 1 < j < (m c — m' r )} as described in lines 4-13 of 
Algorithm [T] 



Determine the values of the bound cells (n' s ) B using equation ( 13 ). 
if (n' s ) B does not contain negative entries then 

return the table n' s e Ts determined by (n' s ) F and (n' s ) B 
else 

STOR {The algorithm terminates without generating a table} 
end if 

This leads us to a new version Algorithm [2] of Algorithm[TJ The successful determination of a table 
in Ts using Algorithm [2] requires the calculation of 2(m c — m' r ) bounds instead of 2m c bounds as 
in Algorithm[TJ Furthermore, the calculation of these bounds is faster because the reduced system 
(11) is used. Lines 2 and 4 of Algorithm [2] can be implemented efficiently using BLAS (Basic 



Linear Algebra Subprograms) Fortran routines for matrix manipulations, thus overall Algorithm[2] 
has a significant computational gain over Algorithm [T] We point out that the determination of the 



RREF should be done for the system ( |T0[ ) and not for the initial system ([4]) since each permutation 
of cell indices could lead to different sets of bound and free cells. Empirically we observed that 
calculating the RREF is computationally inexpensive and can be efficiently performed at each 
application of Algorithm [2j Line 5 of Algorithm [2] is needed because certain combinations of 
values for the free cells might not correspond to any table in T, in which case negative integers are 
found in one or several bound cells. When computing the lower bounds Lf n , ■ and the upper bounds 
U[ n , j for the j-th free cell (n' s )J in line 3 of Algorithm[2| we add the linear constraints associated 



with the sampled values of the first (j — 1) free cells aha make use of the reduced system ( 1 1 ) in 
the corresponding linear programming problems ([8]). The probability that Algorithm [2] samples a 
table n' s 6 Ts is strictly positive: 



II r^>^'-\(n' s )J). (14) 
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4 State of the Art 



Algorithm [T] is key for the sequential importance sampling (SIS) algorithm (Chen et al., 2006 



Dinwoodie and Chen, 2010] ). Tables from T$ are sampled from the discrete distribution given in 



equation (|9]) and are further used to calculate importance sampling estimates of various quanti- 
ties of interest. For example, when calculating exact p-values, tables sampled from the uniform 
and hypergeometric distributions Pu{-) and Ph(-) given in equation ^ are needed, but cannot 
be obtained through a direct sampling procedure. Instead, tables sampled with Algorithm [T] are 
obtained, but these tables yield reliable estimates of exact p-values only if the discrete distribution 
^ is close to the target distributions Pu(-) or Pr{-)- Various cells orderings 5 E A mc and discrete 
distributions f^ L ' u \-) lead to discrete distributions ^ that could be quite far from a desired target 
distribution on T. Unfortunately there is no well defined computational procedure that allows the 
selection of 8 E A mc and f^ L:U \-) for any set of tables T and any target distribution on T. The 
SIS algorithm as described by Chen et al. ( [2005] ), Chen et al. ( 12006] ), Chen ( [2007] ) performs well 



for many applications, but completely fails for the two numerical examples we discuss in Section 
[7] that involve a three-way table with structural zeros and a sparse eight- way binary table. In a re- 



cent contribution, Dinwoodie and Chen ( ]2010 ) propose aprocedure for sequentially updating the 
discrete distribution f^ L ' u \-) from line 11 of Algorithm [I] as a function of the previously sampled 
cell values. With this improved version of SIS they obtain more promising results for the sparse 
eight-way binary table example. However, there is no theoretical argument which shows that the 
examination of other examples will not lead to situations in which SIS does not perform well due 
to the inability of Algorithm [T] to sample tables that receive high probabilities under the target dis- 
tribution. Replacing Algorithm [T] with the more efficient Algorithm [2] in an importance sampling 
procedure leads to improved computing times, but does not solve the critical issues related to find- 
ing appropriate choices of 5 and f^ L:U \-). 



Booth and Butler (1999]) proposed another approach for sampling multi-way tables. They start 



with a log-linear model A with minimal sufficient statistics {nc : C E C} (see Section [2]) and 
consider the expected cell values pL = {^(i 1 ), . . . , fi(i mc )} under A. Their sampling method is 
designed for a reference set of tables T specified by the marginals {nc : C E C}. Since the ability 
to compute the expected cell values fi is key, their framework does not extend to sets of tables that 
are also consistent with some lower and upper bounds n L and n u . Therefore the sampling method 
of Booth and Butler ( |1999| ) has a more limited domain of applicability than Algorithms [T] and [2] 
Booth and Butler ( |1999| ) consider a permutation of cell indices 5 E A mc and partition the reordered 



cells x$ as bound cells xf and free cells xf as in equation (12). Furthermore, they assume that the 
cell counts follow independent normal distributions x$(P) ~ N(jj,(P), fx(P)), 1 < j < m c , which 
implies that the joint distribution of the free cells follows a multivariate normal distribution 



x* s ~ N mc _ m , (£ 5 ,F d J, (15) 

where V s = (^j 1J2 ) is a covariance matrix that depends on pt, and the counts in the marginals 
{nc : C E C}. Algorithm [3] outlines the method for sampling tables from T$ introduced by Booth 



and Butler (1999). 



It is worthwhile to compare how Algorithms |2] and [3] differ. A contingency table in Tg is deter- 
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Algorithm 3 Sample a table n' s e T s (Booth and Butler, 1999) 
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13 
14 



Consider a table n' s whose cells are currently unoccupied. 
Partition the cells as bound xf and free xf. 

Sample 7i ~ N ((/**) i, v{ i), the marginal distribution of (xs)f as derived from the joint dis- 
tribution ( fT5| ). 

Set (n' s )f = [jx]. Here [a] represents the nearest integer to a. 
for j = 2, . . . , m c — m' r do 

Sample from the marginal distribution of (xs)f conditional on the current values of the 



preceding free cells as derived from the joint distribution ( 15 ): 



7i ~ N (E[(a*)f | (xi)fi:c-i)) = K)fi :0 -_i))],Var[(x 5 )f | (a^fi^i)) = K)fi :(i -i))]) • 



Set «)/ =[ 7i ]. 
end for 

Determine the values of the bound cells (n' 5 ) B using equation ( 13 ) 
if (n' s ) B does not contain negative entries then 

return the table n' s 6 T,5 determined by (n' s ) F and (t^)' 8 
else 

STOP. {The algorithm terminates without generating a table} 
end if 



mined in Algorithm [2] by sequentially calculating lower and upper bounds associated with the free 
cell whose value is sampled next, which entails solving 2(m c — m' r ) optimization problems. In 
Algorithm [3] the calculation of bounds is replaced by simulations from multivariate normal distri- 
butions whose means and variances are obtained through fast matrix operations ( |Booth and Butler,| 
1999). Since the determination of bounds comes at a higher computational cost, Algorithm [3] is 



much faster than Algorithm |2j Unfortunately, Algorithm [3] gives no guarantees that it will actually 
identify any table in Tg. A necessary condition for the successful generation of a table in T$ is 
that the values sampled at lines 3 and 6 of Algorithm [3] are actually between their lower and upper 
bounds calculated at line 3 of Algorithm |2j To the best of the authors' knowledge, there does not 
exist any proof of this claim. In fact, this necessary condition is not mentioned in Booth and Butler 
(1999) or in the subsequent work of Caffo and Booth (2001). From a theoretical perspective, there 
is no justification why Algorithm [3] should successfully output a feasible table in T$. Furthermore, 
there is no justification why Algorithm [3] should be able to sample any table in Ts with strictly 
positive probability. Despite being faster, Algorithm [3] should not be preferred to Algorithm [2] 
due to its lack of theoretical underpinning. In addition, Algorithm [2] can be used to sample from 
reference sets of tables defined by bounds constraints in addition to linear constraints induced by 
fixed marginals, while Algorithm [3] cannot be used in such general situations because it relies on 
the calculations of MLEs associated with a log-linear model. 

Algorithm [3] is employed by Booth and Butler ( |1999[ ) to develop an importance sampling ap- 
proach for producing Monte Carlo estimates of exact p- values. Caffo and Booth (|2001[) slightly 
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modify Algorithm [3] by fixing a random number of free cells to develop a Markov chain algorithm 
for conditional inference. Both papers present successful applications of Algorithm [3] in generating 
feasible tables from a reference set T. However, such examples cannot substitute the need to pro- 
vide rigorous proofs justifying the applicability of Algorithm [3] Without such proofs one cannot 
know when to expect Algorithm [3] to succeed or to fail. 

For these reasons the existent literature does not seem to contain a reliable method for calcu- 
lating exact p-values that works for arbitrary multi-way tables subject to linear and bounds con- 
straints. In the next section we propose a new Markov chain algorithm that makes use of Algorithm 
[2] to sample tables from a reference set. There is a significant advantage of using Algorithm [2] in 
the context of a Markov chain algorithm as opposed to an importance sampling procedure such 
as SIS: the discrete distribution ([9]) becomes a proposal distribution for generating the candidate 
for the next state of the chain. The accuracy of the resulting exact p-values estimates is tied sig- 
nificantly less to how close the discrete distribution ([9]) is to the hypergeometric or uniform target 
distributions. Moreover, the instances in which Algorithm [2] ends without successfully generating 
a table in the reference set are thrown out in an importance sampling method. On the other hand, 
a Markov chain procedure makes use of all the output from Algorithm [2] even if no feasible table 
was identified. 



5 The Proposed Markov Chain Algorithm 

We present a Markov chain algorithm that samples from a distribution P* (■) on the reference set of 
tables T whose key component is the dynamic Markov bases introduced in Section [3] Algorithm [2] 
generates feasible tables given an ordering of the cell indices X induced by a permutation 5 £ A mc . 
The partitioning of the cells as bound and free as well as the sequence in which the values of 
the free cells are sampled are a function of the choice of 5. The linear and bounds constraints 
that define T and the sequence of free cells associated with permutations 5 translate into various 
lower and upper bounds for the possible values of a particular cell. Empirically we observed that 
some tables in T receive very high probabilities of being sampled under some permutations in 
A mc , but under other permutations the same probabilities could be very low. Characterizing the 
relationship between the discrete distribution ([9]) and a distribution on T as a function of various 
cell orderings and distributions f^ L:U \-) is a difficult problem that is currently open. The mixing 
time of a Markov chain that calls Algorithm [2] to generate candidate tables could vary considerably 
if the permutation 5 £ A mc remains fixed across iterations. Since there are no theoretical results 
that would allow one to produce cell orderings that lead to smaller mixing times, we develop a 
Markov chain with state space T x A mc with stationary distribution 

Pr(n,(J) = Pr(n | 8)Pr(8). (16) 

Conditional on 5 £ A mc \ {5 }, a table n £ T — T$ is transformed in a table rig £ T$ with the 
same cell counts but a different ordering of its cells. This implies Pr(n \ 5) = Pr(n$). Sampling 



from the joint distribution ( 16 ) is relevant in this context only if P*(-) coincides with the marginal 
distribution Pr(n) = J2s&A m Pf(ns)Pf(8). This condition is satisfied for Pr (ns) = P*(n$) if 
P*(-) is invariant to cell orderings, i.e. P*(ns) = P*(n) for all S £ A mc . The uniform and 
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hypergeometric distributions Pu{-) and Ph(-) from equation ^ are indeed order invariant. Since 
there is no reason to favor a cell ordering over another, we assume a uniform distribution UniA mc (•) 



on the set of possible permutations of X. Therefore the stationary distribution ( 16) is 



Pr(n,5) = P*(n 5 )Un\ A (6). (17) 



We note that the marginal distribution of ( fTTj ) associated with 5 is again uniform. 

We start the chain by sampling a permutation 5^ ~ UniA mc (-) and using Algorithm [2] with 
cell ordering 5^ to sample a table G T. Given a current state (n^\ 5^'), we sample a new 
permutation <5^ +1 ^ ~ UniA mc (•)• We also sample a candidate table n* (t+1) G T 5 ( t +i) from a proposal 

distribution q$(t+i) {n^ t+1) , •). If we did not obtain a feasible table (i.e., n* ^ T), the next state of the 
chain is 5 (t+1) ). If G T, the next state (n (t+1) , <5 (m) ) is (n*, <5 (t+1) ) with the Metropolis- 
Hastings probability 

• /-, p *K( t+ i))g^)K(t+i)."S + i)) l n8 , 

1 ' D / w \ TIT) ; — : f • W 

Otherwise we set n^ +1 ^ = nM\ A sufficient condition for the irreducibility of this Markov chain is 
the positivity of the instrumental distribution, i.e. 

q s (n' s , O > 0, for every (n' 5 , ri£) G T 5 x T s and 5 G A mc . (19) 

It is possible to employ Algorithm [2] to generate candidate tables n*. In this case the Markov 
chain stays at its current state if Algorithm [2] does not generate a feasible table in T. If a feasible 



candidate table n* is identified, the acceptance probability ( 18 ) is calculated based on the proposal 

distribution q S (t+i) (^* t +i) > ) = 7r 5(*+ 1 )./( i:Lr )(-)( ra i5(t+i)) _ see e q uat i° n (9). Since Algorithm 2 
can return any table in T, the positivity condition ([19]) is satisfied. Unfortunately the probability of 
proposing a candidate table at iteration t from the reference set is independent of #1 This leads to 
an erratic behavior of the Markov chain with very small acceptance rates for new candidate tables. 

A better option is to sample candidate tables n* that have a number M of cell counts in common 
with The maximum value for M is (m c — m' r — 1). Recall that (m c — m' r ) is the number 
of free cells associated with T. For a permutation 5 G A mc , we construct a proposal distribution 
qs(nf\ ■) as follows. We partition Ts \ {n s ^} in subsets of tables that have the counts of the first 
M free cells equal with the counts of the first M free cells of np but the count of the (M + l)-th 



free cell different than 



where 



m m c — mi— 1 M 

T S \{n?} = U nbd^Knf), (20) 



nbd M/ (nf) = T^(nf)\Ti )M+1 (nf), and 

T' 5jM {nf) = {n' 5 E T 5 : (n>)f = (nf)f , for j = 1, . . . , m} . 
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We remark that T' 5mc _ m , {rig) = {n s ^}. We refer to the tables in nbd^M^j ) as the neighbors of 
of order M. Any table in Tg \ {nf^} is the neighbor of rig of a particular order. For certain 

values of M G (0 : {m c — m' r — 1)), there might not exist a neighbor of nf of order M. 

We modify Algorithm[2]into Algorithm|4]such that the feasible tables it generates are neighbors 

of order M of table nf\ Line 2 of Algorithm ^guarantees that, if a feasible table is returned, then 

this table belongs to T' SM {n\ f ). By eliminating {rig from the possible values of the free 

cell ( 

n's)M+i in li ne 3, we guarantee that Algorithm |4] does not return a table that belongs to 
Tg tM+1 {n s ^). Algorithm [4] samples a table n' 5 G nbd^A/t^^) with strictly positive probability: 

m c —m' r 

< /W (.),mK I «? } ) « /fe'.- : ^'.M)\«^)M +1 } (K) ^ +i) JJ /^ str W((ni)f ). (21) 

j=M+2 



Algorithm 4 Sample a table n' s G nbd^M^a ) 

1. Consider a table whose cells are currently unoccupied. 

2. Set the counts of the first M free cells of n' s to the corresponding counts of rig, i.e. 

K)f = {nf)l forj = l,...,M. 



3. Sample the values of the remaining free cells {n'g)f, j — M + 1, . . . , m c — m' r using line 3 of 

Algorithm 2 When sampling the value of the (M + l)-th free cell, eliminate {rig )m+i from 
the set of possible values of this cell. 

4. Attempt to determine a full table n' 8 as described in lines 5-12 of Algorithm[2j 

We consider the set of all the local moves associated with rif 1 in Tg\ 

M s {nf) = {n'g-nf-.n'geTgXin?}}. (22) 



Their union U (t) _ A4${n s ^) is a Markov basis for T$. The decomposition &20\ of T$ \ {n 5 l> } as 

sets of neighbor tables of n^p of various orders translates into a corresponding decomposition of 
the set of local moves associated with nf ^ in T s : 



,(*h 



/+\ m c —m' r —l 

Ms{nf)= U =q M s , M {nf), (23) 



where M.s,Nf{n®) = jn^ — : G nbd^Mt™^) j- We dynamically generate local moves in 

M.g{rig) as follows. We consider a discrete distribution gr{-) that gives a strictly positive prob- 
ability to each integer in (0 : {m c — m' r — 1)). We draw M ~ fi'T(-) then employ Algorithm [4] 
to sample a table rig G nbd^nj^). This gives us a local move rig — G M.s,M{rip) without 
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having to determine the entire set Ms,M{n s ^). We use this procedure to sample candidate tables 
for the Markov chain algorithm with stationary distribution ( fT7| ). The corresponding instrumental 
distribution is given by 



-m'-l 



Qs(nf\n* s ) = ^ £t(^K/(^)(.),mK I nf )/r . enbd , iif(f £M. (24) 

M=0 1 ' 

We remark that n* s E nbd^ (n^) implies E nbd^K^) and £ nbds t M'(n}) for M' ^ M. 



Therefore, in order to calculate the Metropolis-Hastings acceptance ratio ( 18 ), we need to evaluate 



only one component of the mixture distribution (|24>. For rig 6 nbd^M^i'')' we nave 



This makes the computing effort needed to run the resulting Markov chain quite manageable. The 

J »« en- 



chain is irreducible because the positivity condition (1 19b is satisfied as a result of U (t) „ Aig(rig) 



being a Markov basis for Tg. 

The choice of the discrete distribution gr(-) is crucial for a good performance of the chain. 
The values of M sampled from gx(-) need to maintain a balance between making large jumps in T 
(hence being more likely to reject the move) and making small jumps in T (hence being less likely 
to reject the move, but spending many iterations around similar tables). Specifying a reasonable 
distribution gr(-) could be a daunting task since it needs to be tailored specifically for T. In the next 
section we give a coherent procedure for finding gr(-) based on a flexible algorithm for exploring 
an arbitrary target set of tables. 



6 The Algorithm for Finding gy( ) 

We present a method for producing a discrete distribution gr(-) required in the specification of the 



proposal distribution (24). Our approach is based on a repeated approximation of the number of 
free cells whose counts need to be fixed before all the other counts are uniquely determined. An 
upper bound for this number is the total number of free cells (m c — m' r ) - see Section|3] However, 
due to the particular configurations of small and large counts of the tables in T and to the presence 
of the bounds constraints ([3]), this number can actually be anywhere between 1 and (m c — m' r ). 

We assume that T contains at least two tables. We consider a permutation 5 E A rrlc and a 
table n* E T. We let Tg C X be the indices of the (m c — m' r ) free cells associated with T and 
5 - see Section || We take 5' E A rric such that . . . = J= & . Recall from 

equation (|7|) that T$i :S (n* s ,) represents the set of tables in T that have the same counts in cells 
{i s '^\ as table n*. There exists a unique index s (5', n*) E (0: (m c — m' r — 1)) such that 

(CI) Tp lS {6' ,n*){n*8') contains at least one table in T that is different than n*; 

(C2)T Wjn . )+1 (n$,) = K,}. 
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If Tsr s {n* &l ) contains a table different than n* 5 ,, there must exist j G ((s + 1) : (m c — m',)) such that 
the lower and upper bounds of the corresponding cell are different: 



< 



(25) 



This condition is necessary, but it is not sufficient. That is, equation (25) might hold while still 
Ts^ s {n* 5 ,) = {n* s ,}. As such, the computation of real bounds cannot substitute actually checking 
that Ts' :S (n* s ,) \ {n* s ,} ^ 0, but such a check is computationally expensive. We reduce this comput- 
ing effort by first determining an upper bound for s(5',n*). 

Algorithm [^performs a binary search to determine the maximum index s b (5', n*) < (m c — m' r ) 
such that 



n [I 



tjR (j&tiY 



(26) 



for j e ((s b (5',n*) + 2) 



- m' r )), but 



least one index j e ((s b (5',n*) + 1) 



5',n*,s b (6',n*) [ 



i.8(3» 



m, 



— m' r )). Condition (25) is a 



< 



for at 



<5',n*,s b (<5',n*) 

ways satisfied for s 

long as T contains at least two tables. Moreover, condition (|25j is never satisfied for s = m r — nr 
since T* m „. 



as 
i 



{n* s ,}. At the completion of Algorithm 5j the index s b (S', n*) is returned 



and we still need to determine s(5', n*). Since T s , s b(s',n*)+i( n s') = { n *s'}' condition (C2) is satis- 
fied and hence s(5', n*) < s b (5', n*). Algorithm [6] starts with s b (5', n*) as the initial guess for the 
value of s(5', n*) and sequentially decreases this guess until condition (CI) is also satisfied. 



Algorithm 5 Determination of s b (5', n*) 

1. Set si <— and s 2 <— (m c — m' r ). 

2. while si + 1 7^ s 2 do 

3. Sets^ L( s i + S 2)/2J. 

4. for j = s + 1, . . . , m c — m' r do 

5. Calculate the real lower and upper bounds Lf, n * s {i 5 ^') and U$ n * s (i s ^). 

6. end for 

7. if \Lf l>n * >s {i s M)] = lU* n * >s {iW)\ for all j = s + 1, . . . , m c - m' r then 

8. Set s 2 <- s. 

9. else 

10. Set sx <- s. 

11. end if 

12. end while 

13. Return s b {5',n*) <h- Si. 



Algorithm [6] returns a value of s that is less or equal than s(5, n'). However, for our purposes, 
this lower bound is sufficient. Algorithm [7] estimates the discrete distribution gr(-) by repeatedly 
calling Algorithms [2| [5] and [6] for a large number of iterations i ma x- The value of gr(j), j G (0 : 
(m c — m' r — 1)), is proportional with the number of iterations in which keeping j counts of free 
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Algorithm 6 Determination of a lower bound of s(S', n*) 



1. for s = s b {5', n*), s b {5', n*) - 1, . . . , do 

2. Set free cells {i s . . . , i 5 '^)} to the corresponding values from n*. 

3. Sample the values of the remaining components of the vector of free cells xf using lines 
5-14 of Algorithm [1} 

4. Attempt to determine a full table n' E T as described in lines 5-12 of Algorithm [2] 

5. if a table n' E T was determined then 

6. if n' ^ n* then 

7. return s 

8. end if 

9. end if 
10. end for 



cells fixed resulted in the successful sampling of a feasible table different than some other randomly 
generated feasible table. The initialization from line 2 of Algorithm [7J assures that the distribution 
gr(-) returned by the procedure satisfies gr(j) > for any j E (0 : (m c — m' T — 1)) which is 
a condition required to generate all the local moves from equation ( |2~2~| ). Algorithm [7J effectively 
explores the set of tables T and identifies a distribution gr(-) based on this exploration. We remark 
that we have not made any assumptions about a parametric form for gr(-)- The structure of T 
dictates the probabilities that define gr(-) which leads to a very flexible choice of gr(-) which is 
adapted to the structure of T. 



Algorithm 7 Determination of g?{ 



1. Consider a vector G with indices (1 : (m c — m' r )). 

2. Set G(j) <- 1 for j E (1 : (m c - m' r )). 

3. fori = 1,2, . . . ,i max do 
Call Algorithm [2] until it generates a random table n* E T. 

Generate a random permutation 5 E A mc and find the RREF of the linear system (10). 



4. 

5. 

6. 

7. 

8. 

9. 
10. 
11. 



Call Algorithm 5 
Call Algorithm 6 
Set G(s) <- G(s) 
end for 

Setg T (j) = G{j + l)/h 
return g T { ) 



to determine s b (S', n*). 
to determine s < s(8',n*). 
+ 1. 



x for j E (0 : 



m. 



1)). 



7 Examples 

We illustrate the use of the Markov chain algorithm with dynamic Markov bases in two examples. 
The first example involves a three-way table with structural zeros, while the second example in- 
volves a sparse eight-way table. Both examples have been chosen to show the effectiveness of 
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the Markov chain algorithm described in Sections [5] and [6] with respect to competing approaches 
proposed in the literature. For both examples, we have been unable to generate a Markov basis us- 
ing the computational algebraic techniques of Diaconis and Sturmfels (1998), which renders their 
sampling approach inapplicable. The sequential importance sampling (SIS) algorithm of Chen et 
al. ( 2006| ) is applicable, but fails to provide any meaningful results by giving estimates equal to 



1 for all the p-values we calculate. The Markov chain algorithm of Caffo and Booth ( 2001[ ) (CB, 



henceforth) as implemented in the R package exactLoglinTest ( |Caffo, 2006[ ) is not applica- 
ble for tables with structural zeros, hence it does not produce any estimates for our first example. 

We run 100 independent Markov chains of length 2500000 with a burn-in time of 25000 itera- 
tions. The chains were run with the dynamic Markov bases approach and the CB algorithm. The 
SIS algorithm was run until it generated an equal number of sampled tables. We sampled from the 
hypergeometric distribution Ph(-) and calculated estimates for the exact p-values associated with 
the X 2 and G 2 statistics for the all two-way interaction model. We run 100 replicates of Algorithm 
[7] for 100000 iterations to find the distribution gr(-) that defines the Metropolis-Hastings proposal 



distribution (24). 



We estimate the Monte Carlo error using the non-overlapping batch means method of Geyer 



( 1992 ). Each of the 100 independent chains was divided in 10 batches of size 250000. The standard 
error of an exact p-value estimate is the sample standard error of the p-value estimates correspond- 
ing with the 1000 resulting batches. The Monte Carlo errors are given after the "±" sign following 
the Monte Carlo estimate of the exact p-value. We report the computing time necessary to generate 
one batch of 250000 iterations throughout. We use OpenMPI (http : / / www . open-mpi .org/) 
to obtain batches by running independent processes on several processors. 

We performed our computations on a Mac Pro computer with 2 x 2.26 GHz quad-core Intel 
Xeon processors with 16 GB of memory. We report the mean elapsed computing time in seconds 
with standard errors calculated across the replicates. We wrote our own C++ implementation of the 



SIS algorithm by following the description from Chen et al. (2006). The tables have been sampled 
in SIS using Algorithm [T] with cell values generated from the hypergeometric distribution fh(-)- 
We implemented the algorithms described in Sections [3j [5] and [6] in C++. The linear programming 
problems @ have been solved with IBM ILOG CPLEX Opt imizer (http : / / www . ibm . com) 
routines. All the code and the datasets needed to replicate the numerical results from this section 
are available as supplemental materials. 

7.1 NBER data 

Table [T] is a 4 x 5 x 4 cross-classification of 4345 individuals by occupational groups (01 - 
"self-employed, business", 02 - "self-employed, professional", 03 - "teacher", 04 - "salary- 
employed"), aptitude levels (A) and educational levels (E). It was collected in a 1969 survey of the 



National Bureau of Economic Research (NBER) - see Table 3-6 page 45 from Fienberg (2007). 
The horizontal lines denote structural zeros. The ten structural zeros under 03 and El, E2 are 
associated with teachers being required to have higher education levels. The other two structural 
zeros under 02 can be motivated in a similar manner. 

The number of degrees of freedom for the all two-way interaction model is calculated by sub- 
tracting the number of structural zeros from 36 - the number of degrees of freedom corresponding 
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with a 4 x 5 x 4 table without structural zeros. Bishop et al. ( |1975p argue that the number of de- 
grees of freedom must be increased by the number of structural zeros that are present in marginal 
tables that are among the minimal sufficient statistics of the log-linear model considered. In this 
case there are two such counts present in the aptitude by educational levels marginal. The resulting 
number of degrees of freedom is 36 — 12 + 2 = 26. The observed value of the likelihood-ratio 
test statistic is G 2 = 15.91 which leads to an asymptotic p-value for the all two-way interactions 
model of 0.938. The observed value of the X 2 test statistic is 17.1 which leads to an asymptotic 
p-value of 0.906. 

The Markov chains with dynamic Markov bases lead to an estimate of the G 2 exact p-value of 
0.9650 ± 0.0037 and to an estimate of the X 2 exact p-value of 0.9134 ± 0.0068. Figure [j] shows 
the convergence of the Markov chain algorithm from Section [5] across the 100 chains. We remark 
that the large sample size of the NBER data leads to a good agreement between the asymptotic 
and the exact p-values. The computing time for one batch of 250000 tables for our Markov chain 
algorithm is 1882.58 ± 1.33 seconds. Figure [2] shows the estimated distribution function gr(-) ob- 
tained from 10 million iterations of Algorithm [7] The number of free cells is 26 hence its domain 
is {0,1,..., 25}. We see that the mode of g T (-) is <7t(24) = 0.604 which represents the estimated 
probability of obtaining a feasible table different than the current table after fixing the values of 24 
free cells. The running time of Algorithm [7] is 1266 ± 0.93 seconds per 100000 iterations. 



Table 1 : NBER data. The grand total of this table is 4345. 
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7.2 Rochdale data 

The data in Table [2] is a cross-classification of eight binary variables relating women's economic 
activity and husband's unemployment from a survey of households in Rochdale - see Whittaker 
(1990) page 279. The variables are as follows: a, wife economically active (no,yes); b, age of 
wife > 38 (no,yes); c, husband unemployed (no,yes); d, child < 4 (no,yes); e, wife's education, 
high-school+ (no,yes); /, husband's education, high-school+ (no,yes); g, Asian origin (no,yes); h, 
other household member working (no,yes). There are 665 individuals cross-classified in 256 cells, 
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Figure 1 : Convergence of 100 independent Markov chains based on the dynamic Markov basis for 
the NBER data and the all two-way interaction model. The upper panel shows the convergence to 
the estimate 0.9134 of the X 2 exact p-value, while the lower panel shows the convergence to the 
estimate 0.9650 of the G 2 exact p-value. The rr-axis gives the number of iterations in increments 
of 250000. For each chain we calculated p-value estimates based on 250000i sampled tables with 
i = 1,2,. ..,10. The dotted line represents the mean of these incremental estimates, while the 
solid lines represent their 2.5% and 97.5% quantiles. 
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Number or fixed cells 



Figure 2: The discrete distribution g T ( 
as determined by Algorithm [7] 



for the NBER data and the all two-way interaction model 



which means that the mean number of observations per cell is 2.6. The table has 165 counts of 
zero and 217 other cells contain at most three observations. 

Whittaker ( 1990) argues that this table is sparse and subsequently that the applicability of any 
asymptotic results relating to the limiting distributions of goodness-of-fit statistics for log-linear 
models becomes questionable due to the zeros present in marginals of dimension three or more. 
The likelihood-ratio test statistic for the all two-way interaction model is G 2 = 144.59, while the 
observed X 2 test statistic is 258.65. The all two-way interaction model has 219 degrees of free- 
dom, which leads to asymptotic p-values of 1 for the G 2 statistic and of 0.034 for the X 2 statistic. 

The Markov chain algorithm with dynamic Markov bases and the CB algorithm give simi- 
lar estimates of the exact p-values. More specifically, the exact G 2 p-value is estimated to be 
0.1668 ± 0.0684 by our approach and 0.1644 ± 0.0443 by the CB approach. The exact X 2 p-value 
is estimated to be 0.1642 ± 0.0524 by our approach and 0.1717± 0.1101 by the CB approach. The 
Monte Carlo standard errors for the G 2 p-value of both Markov chain algorithms are comparable, 
but the CB algorithm gives a larger standard error when computing the X 2 p-value. In a recent pa- 



per, Dinwoodie and Chen (2010) report two different estimates (0.223 ± 0.091 and 0.186 ± 0.041) 
of the exact G 2 p-value obtained with their new version of the SIS algorithm based on two cell 
orderings. We found estimates equal to 1 for both the G 2 and X 2 exact p-values using our imple- 



mentation of the SIS algorithm of Chen et al. (2006 1. 

Figure [3] illustrates the convergence of the Markov chain algorithm from Section [5] across its 
100 replicates. Its running time is 18821.75 ± 304.01 seconds per 250000 iterations. Our Markov 
chain algorithm makes use of an estimate of the discrete distribution gr(-) that is obtained by run- 
ning Algorithm [7] for 10 million iterations. It takes approximately 8813.6 ± 40.62 seconds per 
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100000 sampled tables to obtain the distribition gx(-) from Figure[4j 



Table 2: Rochdale data from Whittaker ( |1990[ ). The cells counts are written in lexicographical 
order with h varying fastest and a varying slowest. The grand total of this table is 665. 



K 
o 


n 

U 


9 


1 


K 

'J 


1 


n 

u 


n 
u 


4 


i 

_L 


n 

\J 


n 


a 

u 


n 
u 


9 


n 
u 


Q 
o 


n 

U 


1 1 


n 


1 


n 
u 


1 


n 


Q 

o 


n 


1 
1 


n 




u 


1 


u 


K 
•J 


n 

U 


9 


n 

u 


n 
u 


n 
u 


n 

u 


n 


n 

u 


n 

u 


n 


n 


n 

u 


u 




u 


A 

~r 


U 


8 
o 


9 


6 





1 


n 


1 

J. 


n 


1 






u 


1 

J. 


u 


1 7 


1 

J-U 


1 

J. 


1 

J. 


16 


7 


n 

u 


n 

u 


n 
u 





n 

\J 


n 

u 


1 


u 


n 
u 


n 

u 


i 


n 

u 


9 


n 

u 








n 

u 


n 

u 


i 


n 
u 


n 
u 


n 
u 


n 
u 


n 
u 


n 
u 


n 
u 


4 


1 


o 
O 


1 

1 


1 


1 


o 
z 


n 
U 


i 


n 
U 


n 
U 


n 
U 


1 


n 
U 


n 
U 


n 

u 








3 









































18 


3 


2 





23 


4 








22 


2 








57 


3 








5 


1 








11 





1 





11 











29 


2 


1 


1 


3 











4 











1 























1 


1 












































41 


25 





1 


37 26 








15 


10 








43 


22 




















2 























3 











2 


4 








2 


1 











1 








2 


1 

























































8 Conclusions 

In this paper we introduced dynamic Markov bases and proposed a Markov chain algorithm for 
sampling tables based on them. Our methods are applicable off-the-shelf to calculate exact p- 
values for reference sets of tables defined by any type of linear and bounds constraints. The choice 



of distribution gry) that is used in the mixture instrumental distribution (24) is key for a success- 
ful application of the Markov chain algorithm described in Section [5J The running time of our 
sampling approach is a function of the expected number of optimization problems ([8]), i.e. 

Q(g T ) = 2(m c -m' r -E gT (M)), 

that need to be solved to generate one candidate table from ( p4] ). In our NBER data example, 
the number of free cells is 26 which yields Q(gr) = 5.46 for the distribution gr(-) from Figure 
[51 By comparison, if we would work with the uniform distribution gr(-) = Uni( 0: ( mc _ m ;_i))(-) 



in the instrumental distribution (24), the expected number of optimization problems increases to 
Q(Uni( ;25)) = 27. For the Rochdale data example we obtain Q(gr) = 134.1 for the distribution 
gr(-) from Figure [4] and Q(Uni( 0: 2i8)) = 220. As such, Algorithm [7] is quite effective in deter- 
mining distributions gr(-) that are lead to Markov chains with dynamic Markov bases with good 
mixing properties and reasonable running times. Finding suitable distributions gr(-) that are prop- 
erly adapted to a reference set of tables T in the absence of a well-defined procedure could be 
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Figure 3: Convergence of 100 independent Markov chains based on the dynamic Markov basis for 
the Rochdale data and the all two-way interaction model. The upper panel shows the convergence 
to the estimate 0.1642 of the X 2 exact p-value, while the lower panel shows the convergence to the 
estimate 0.1668 of the G 2 exact p-value. The rr-axis gives the number of iterations in increments 
of 250000. For each chain we calculated p-value estimates based on 250000i sampled tables with 
% = 1, 2, . . . , 10. The dotted line represents the mean of these incremental estimates, while the 
solid lines represent their 2.5% and 97.5% quantiles. 



22 



0.025 - 



0.02 - 



0.015 - 



0.01 - 



0.005 - 




20 40 60 80 100 120 140 160 180 

Number of fixed cells 



Figure 4: The discrete distribution for the Rochdale data and the all two-way interaction 
model as determined by Algorithm [7] The number of free cells is 219. Remark that gr(-) is 
considerably more diffuse than the corresponding distribution for the NBER data - see Figure [2} 

detrimental in practice, hence Algorithm [7] should be seen as integral part of the dynamic Markov 
bases methodology we proposed. 

We hope that the basic idea of generating only the moves needed to complete one iteration of 
the random walk will be adopted by other researchers since it is a more practical alternative to the 
determination of the entire Markov basis in one computationally intensive step as it was originally 
suggested in Diaconis and Sturmfels (1998). Relevant questions relate to studying the theoretical 



properties of dynamic Markov bases using algebraic statistics in the spirit of Rapallo (2006), Aoki 
and Takemura (|2010[) and Rapallo and Yoshida (|2010[). These research directions should be added 



to the list of open problems related to Markov bases presented in Yoshida (2010) 



Supplemental Material 

Computer Code and Data: Supplemental materials for this article are contained in a single 
zip archive and can be obtained in a single download. This archive contains the datasets NBER 
and Rochdale (in text files) as well as the C++ source code to run the algorithms described in 
this article (the Markov chain based on the dynamic Markov bases and the sequential importance 
sampling algorithm). A detailed description of the files contained in this archive is contained in a 
README.txt file enclosed in the archive. 
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